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Abstract 



We propose a new type of hidden layer for a 
multilayer perceptron, and demonstrate that 
it obtains the best reported performance for 
an MLP on the MNIST dataset. 



1 The piecewise linear activation 
function 

We propose to use a specific kind of piecewise linear 
function as the activation function for a multilayer per- 
ceptron. 

Specifically, suppose that the layer receives as input a 
vector x 6 M. D . The layer then computes presynaptic 
output z = x T W + b where W € M. DxN and b 6 l w 
are learnable parameters of the layer. 

We propose to have each layer produce output via the 
activation function h(z)i = maxj^Zj where Si is a 
different non-empty set of indices into z for each i. 

This function provides several benefits: 

• It is similar to the rectified linear units 



• Max pooling over groups of units allows the fea- 
tures of the network to easily become invariant to 
some aspects of their input. For example, if a unit 
hi pools (takes the max) over z%, z%, and Zs, and 
z\ , Z2 and z^ respond to the same object in three 
different positions, then hi is invariant to these 
changes in the objects position. A layer consist- 
ing only of rectifier units can't take the max over 
features like this; it can only take their average. 

• Max pooling can reduce the total number of pa- 
rameters in the network. If we pool with non- 
overlapping receptive fields of size k, then h has 
size N/k, and the next layer has its number of 
weight parameters reduced by a factor of k rela- 
tive to if we did not use max pooling. This makes 
the network cheaper to train and evaluate but also 
more statistically efficient. 

• This kind of piecewise linear function can be seen 
as letting each unit hi learn its own activation 
function. Given large enough sets Si, hi can im- 
plement increasing complex convex functions of 
its input. This includes functions that are already 
used in other MLPS, such as the rectified linear 
function and absolute value rectification. 
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useful for many classification tasks. 

• Unlike rectifier units, every unit is guaranteed to 
have some of its parameters receive some training 
signal at each update step. This is because the in- 
puts Zj are only compared to each other, and not 
to 0., so one is always guaranteed to be the maxi- 
mal element through which the gradient flows. In 
the case of rectified linear units, there is only a 
single element Zj and it is compared against 0. 



In the case when > z 
signal. 
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Zj receives no update 



Preliminary work. 



We used Si = {5i, M + 1, ...5i + 4} in our experiments. 
In other words, the activation function consists of max 
pooling over non-overlapping groups of five consecu- 
tive pre-synaptic inputs. 

We apply this activation functi on to the multilaye r 
perceptron trained on MNIST bv lrlinton et all (|2012h . 
This MLP uses two hidden layers of 1200 units each. In 
our setup, the presynaptic activation z has size 1200 so 
the pooled output of each layer has size 240. The rest 
of our training setup remains unchanged apart from 
adjustment to hyperparameters. 

Hinto n et all (|2012l ) report 110 errors on the test set. 
To our knowledge, this is the best published result on 
the MNIST dataset for a method that uses neither 
pretraining nor knowledge of the input geometry. 
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It is not clear how iHinton et all ([20121 ) obtained a sin- 
gle test set number. We train on the first 50,000 train- 
ing examples, using the last 10,000 as a validation set. 
We use the misclassification rate on the validation set 
to determine at what point to stop training. We then 
record the log likelihood on the first 50,000 examples, 
and continue training but using the full 60,000 example 
training set. When the log likelihood of the validation 
set first exceeds the recorded value of the training set 
log likelihood, we stop training the model, and evalu- 
ate its test set error. Using this approach, our trained 
model made 94 mistakes on the test set. We believe 
this is the best-ever result that does not use pretrain- 
ing or knowledge of the input geometry. 
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